FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication
نویسندگان
چکیده
Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems in the past 15 years or so due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cutpoints by computing and judging the rolling hashes of the data stream byte by byte. In this paper, we propose FastCDC, a Fast and efficient CDC approach, that builds and improves on the latest Gear-based CDC approach, one of the fastest CDC methods to our knowledge. The key idea behind FastCDC is the combined use of three key techniques, namely, simplifying and enhancing the hash judgment to address our observed challenges facing Gear-based CDC, skipping sub-minimum chunk cut-point to further speed up CDC, and normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping. Our evaluation results show that, by using a combination of the three techniques, FastCDC is about 10× faster than the best of open-source Rabin-based CDC, and about 3× faster than the state-of-the-art Gearand AE-based CDC, while achieving nearly the same deduplication ratio as the classic Rabin-based approach.
منابع مشابه
Leap-based Content Defined Chunking - Theory and Implementation
Content Defined Chunking (CDC) is an important component in data deduplication, which affects both the deduplication ratio as well as deduplication performance. The sliding-window-based CDC algorithm and its variants have been the most popular CDC algorithms for the last 15 years. However, their performance is limited in certain application scenarios since they have to slide byte by byte. The a...
متن کاملDdelta: A deduplication-inspired fast delta compression approach
Delta compression is an efficient data reduction approach to removing redundancy among similar data chunks and files in storage systems. One of the main challenges facing delta compression is its low encoding speed, a worsening problem in face of the steadily increasing storage and network bandwidth and speed. In this paper, we present Ddelta, a deduplication-inspired fast delta compression sch...
متن کاملAccelerating Data Deduplication by Exploiting Pipelining and Parallelism with Multicore or Manycore Processors
As the amount of the digital data grows explosively, Data deduplication has gained increasing attention for its space-efficient functionality that not only reduces the storage space requirement by eliminating duplicate data but also minimizes the transmission of redundant data in data-intensive storage systems. Most existing state-ofthe-art deduplication methods remove redundant data at either ...
متن کاملA Dynamic Deduplication Approach for Big Data Storage
As data is increasing every day, so it is very challenging task to manage storage devices for this explosive growth of digital data. Data reduction has become very crucial problem. Deduplication approach plays a vital role to remove redundancy in large scale cluster computing storage. As a result, deduplication provides better storage utilization by eliminating redundant copies of data and savi...
متن کاملBimodal Content Defined Chunking for Backup Streams
Data deduplication has become a popular technology for reducing the amount of storage space necessary for backup and archival data. Content defined chunking (CDC) techniques are well established methods of separating a data stream into variable-size chunks such that duplicate content has a good chance of being discovered irrespective of its position in the data stream. Requirements for CDC incl...
متن کامل